Automated Fault-Inject Based Dependability Analysis of Distributed Computer Systems
نویسنده
چکیده
Recently, there has been interest in developing a dependability benchmarks for computer systems. This will require a way to inject several different types of faults into many different platforms and a way to collect and compare the results. Analyzing complex heterogeneous distributed systems share the same needs. The current approach to building fault injection tool is inappropriate for these goals. NFTAPE (Networked Fault Tolerance and Performance Evaluator), a tool for designing and executing fault injection campaigns, make these goals possible by using lightweight fault injectors (LWFI). NFTAPE provides (1) an environment for executing fault injection campaigns using a library of fault injection programs (such as LWFI, triggers, monitors, and so on), (2) support services for communicating between test nodes, managing processes, and logging results, and (3) a programming model and API for writing support processes including exist fault injectors. Unlike older fault injectors, NFTAPE can use any fault injection method (software implemented fault injection, physical fault injection, or simulation). A few new variations on fault injections methods have come out developing NFTAPE so far: driver-based fault injector, target-specific fault injectors and performance faults. Already, NFTAPE has been used to run fault injection campaigns on several platforms (Linux, Solaris, LynxOS, and Windows) using several fault injection methods (debugger-based, driver-based, target-specific, and special-purpose hardware-based).
منابع مشابه
Test Probe Inject Fault State Machine Notifications Recorder State Machine Transport Fault Parser Loki RuntimeSystem Under Test Probe Inject Fault State Machine Notifications Recorder State Machine Transport Fault Parser
Validating distributed systems is particularly difficult, since failures may occur due to a correlated occurrence of faults in different parts of the system. This paper describes the basis for and preliminary implementation of a new fault injector, called Loki, developed specifically for distributed systems. Loki addresses issues related to injecting correlated faults in distributed systems. In...
متن کاملTesting and Fault Injection of Distributed Protocols
A growing challenge confronting designers and implementors of safety-critical distributed systems is the evaluation and validation of dependability requirements. This paper address the problem of testing fault-tolerance capabilities of distributed protocols. It introduces a general framework for fault injection and testing of distributed systems and it describes an ongoing development of a tool...
متن کاملNFTAPE: A Framework for Assessing Dependability in Distributed Systems with Lightweight Fault Injectors
Many fault injection tools are available for dependability assessment. Although these tools are good at injecting a single fault model into a single system, they suffer from two main limitations for use in distributed systems: (1) no single tool is sufficient for injecting all necessary fault models; (2) it is difficult to port these tools to new systems. NFTAPE, a tool for composing automated ...
متن کاملImproving Dependability in Service Oriented Architectures using Ontologies and Fault Injection
Large distributed systems and computer grids are increasingly being used in science and in business, with Service Oriented Architectures combined with Web services the current favoured solutions to access these distributed, heterogeneous resources. However, service-based systems have a high reliance on a middleware which is continuously evolving. This requires novel methods of testing and evalu...
متن کاملVerification of cooperative transient fault diagnosis and recovery in critical embedded systems
The faults caused by ambient cosmic radiation are a growing threat to the dependability of advanced embedded computer systems. Maintaining availability and consistency in distributed applications is one of the fundamental attribute in building complex critical systems. To achieve this, a key factor is the ability to detect the fault and handle it by means of recovery. Such systems can use membe...
متن کامل